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ABSTRACT 

Summary: RegaDB is a free and open source data management and 
analysis environment for infectious diseases. RegaDB allows clinicians 
to store, manage and analyse patient data, including viral genetic se- 
quences. Moreover, RegaDB provides researchers with a mechanism 
to collect data in a uniform format and offers them a canvas to make 
newly developed bioinformatics tools available to clinicians and virolo- 
gists through a user friendly interface. 

Availability and implementation: Source code, binaries and 
documentation are available on http://rega.kuleuven.be/cev/regadb. 
RegaDB is written in the Java programming language, using a web- 
service-oriented architecture. 
Contact: pieter.libin@rega.kuleuven.be 

Received on November 21, 2012; revised on March 13, 2013; 
accepted on April 1, 2013 

1 INTRODUCTION 

Advances in infectious diseases research require efficient collab- 
oration and exchange of clinical and virological data. 

*To whom correspondence should be addressed. 



Researchers need access to large amounts of data to test hypoth- 
eses or extract valuable information through data mining (Sloot 
et ah, 2008, 2009). For this purpose, RegaDB was developed as a 
free and open source data management and analysis environ- 
ment for infectious diseases (Libin et al, 2007). 

RegaDB runs on Windows, Linux or Mac OS X. The system 
can be installed within a hospital or institute so that the data 
stays within the clinical environment. RegaDB follows the idea 
of an integrated environment for bioinformatics analysis, such as 
the Genetic Data Environment (de Oliveira et al., 2003), ViroLab 
(Assel et al., 2009) and Geneious (Drummond et al., 2011). The 
difference is that RegaDB uses a relational database, and can be 
locally or remotely accessed. This allows RegaDB to be used for 
clinical management and/or research in one locality or for long- 
term data-sharing collaborations between different institutes. 



2 DATABASE STRUCTURE AND TOOLS 

RegaDB's database enforces the data abstraction paradigm 
(Fig. 1). This approach ensures flexibility, as the database can 
be conveniently extended as needed without upgrading its 
schema in most of the cases (Imbrechts et ah, 2009). All abstract 
data entities are connected to a central patient entity, including 
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Fig. 1. An overview of RegaDB's database entities and functionalities 



attribLites, tests, events, therapies and viral isolates. Attributes 
annotate a patient with information, which is typically of a clin- 
ical or epidemiological nature, e.g. the gender or transmission 
risk group. RegaDB implements tests as values that are obtained 
at a given moment in time, i.e. there is only one date associated 
with it. The results can be in vivo or in vitro measurements, ap- 
pointments, as well as computational results obtained from a 
web service. General tests are used to store data extracted from 
patient samples, e.g. cell counts and viral loads. Tests can also be 
linked to viral isolates, e.g. typing and subtyping results, to 
drugs, e.g. therapeutic drug monitoring, or to a combination of 
an isolate and a drug, e.g. phenotypic and genotypic resistance 
interpretations. Events cover a specific time interval in the pa- 
tient's history, i.e. have a start and end date, e.g. AIDS-defining 
illnesses or pregnancy. The default list of attributes, tests and 
events available in the system can be extended via the user inter- 
face. In this way, RegaDB can be tailored to the user's needs or 
research interests. Attributes, tests and events are annotated with 
a data type (numbers, strings, nominal values, etc.), which allows 
the user interface and data access layer to maintain data integ- 
rity. The therapy entity allows users to store the medication his- 
tory of a patient. A single therapy consists of a start date, a stop 
date and a combination of drugs, i.e. a regimen, which the users 
can select from a list of both generic and commercial drug 
names. When the therapy has a stop date, the clinician can in- 
dicate a reason for ending or switching the treatment, e.g. resist- 
ance, side effects or adherence issues. 

A viral isolate contains one or more nucleotide sequences, 
allowing multiple sequences extracted from one viral genome 
to be grouped together. Once an isolate is added to RegaDB, 
the corresponding pathogen is determined by invoking a web 
service that implements a BLAST search procedure (Altschul 
et al., 1990). When RegaDB supports the pathogen, the appro- 
priate reference sequence is loaded and used to perform a codon- 
correct alignment with frame-shift detection and correction. The 
alignment procedure finds the protein reading frames encoded by 
the sequences that make up the isolate. This information, to- 
gether with all detected point mutations, insertions and deletions, 
is stored in the database. The alignment web service implements 
the Needleman-Wunsch algorithm in C++ (Needleman and 
Wunsch, 1970) to analyse large sequences efficiently. 



Depending on the pathogen determination returned by the web 
service, the viral isolate is directed to a typing web service 
(Alcantara et al., 2009; de Oliveira et al., 2005) and/or resistance 
interpretation web service (Liu and Shafer, 2006). Table 1 shows 
detailed information on reference sequences and bioinformatics 
tools available for the supported pathogens. RegaDB supports 
the use of bioinformatics tools published on the web as web 
services. 

All data can be viewed and edited through a web-based inter- 
face. Key parameters of a patient's clinical history are visualized 
in a patient chart as a time-line annotated with viral loads, CD4 
counts, regimens and viral isolate time points. RegaDB can 
export patient details into a report document by replacing vari- 
ables in a user-designed RTF template. 

Several tools are already available or are being developed, 
some of which by the users. Drug resistance interpretation can 
be performed according to several algorithms. For HIV, various 
versions of the Stanford algorithms (HIVdb, Liu and Shafer, 
2006), the Rega algorithms (Van Laethem et al., 2002) and the 
ANRS algorithms (Meynard et al., 2002) are implemented. For 
each algorithm, a cumulative overview is available, whereby re- 
sistance detected in a patient is taken forward to the last sample. 
Evolution of a virus isolate is tabulated as amino acid changes 
compared with the previous isolate from the same patient. 
Another tool allows plotting a phylogenetic tree constructed 
from a set of sequences with a pre-defined similarity to a query 
sequence. To ensure the quality of the sequence database, a tool 
was developed that can be used to flag potential contaminations, 
errors in sampling or data entry, super infection or transmission 
chains, by detecting unusual intra- or inter-patient evolutionary 
distances. 

Attributes are synchronized with a central repository to ensure 
compatibility between different RegaDB instances. The central 
repository contains a collection of standardized data fields and 
corresponding values such as demographic information (country 
of origin, transmission risk group, etc.), test results (viral load, 
cell count, etc.) and drug names (both generic and commercial). 
In addition, this repository also provides access to the latest ver- 
sions of drug resistance algorithms. Compatibility functionalities 
allow the system to be updated, with minimal effort, as new 
content becomes available. 



3 OPPORTUNITIES FOR RESEARCHERS 

When the development of RegaDB started, several custom-made 
databases were available that allowed users to enter ambiguous 
representations of data, for example, different representations for 
the same medical compound. However, to facilitate efficient data 
exchange and to make the execution of aggregate queries pos- 
sible, it is important that data are available in a structured 
format. By providing support for explicit data types and enfor- 
cing these data types through the user interface, RegaDB circum- 
vents many difficulties that might complicate the exchange of 
data. 

RegaDB allows data to be exported in XML format from 
local data sources (hospitals, institutes), and these exports can 
be combined in a research database. 

Data from other databases can be imported via a generic 
import tool. RegaDB also provides a programming interface. 
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Table 1. Pathogens currently supported by RegaDB, annotated with the reference sequence used for alignments and with the subtyping and resistance 
interpretation bioinformatics tools applied to new isolates of the respective pathogen 



Pathogen 


Reference sequence 


Genotyping 


ASI resistance 




(Genbank accession) 




interpretation 


HIV-1 


HXB2 (K03455) 


Rega HIV Subtyping Tool 


REGA, HIVDB, ANRS 


HIV-2a 


ROD (Ml 5390) 


Rega HIV Subtyping Tool 


REGA, ANRS 


HIV-2b 


EHO (U27200) 


Rega HIV Subtyping Tool 




HCV 


H77 (AF009606) 


Oxford HCV Subtyping Tool 




HTLV 


HTLV-1 (J02029) 


LASP HTLV-1 Subtyping Tool 





which can be used to develop custom import programs to sup- 
port more complicated data sources. A procedure to import data 
encoded in the HICDEP (hicdep.org) format directly into 
RegaDB is currently under development. 

A research database will generally be accessed via the Internet; 
therefore, authentication is an important security aspect. 
RegaDB supports password-based authentication by default. 
The authentication module abstraction allows for a straightfor- 
ward implementation of alternative authentication back-ends 
(Openld, Kerberos, etc.), which makes it possible for RegaDB 
to connect to existing user management systems. The application 
will only allow registered users to access the system. Once 
granted access to the system, a user is only able to access patient 
information that belongs to a dataset connected to the user's 
profile. The owner of the dataset can configure the access of 
users to this dataset, and revoke the access after a certain analysis 
or assignment is finished. 

Researchers can query RegaDB using the visual query tool, 
which allows users to define complex queries guided by a user 
interface. Query definitions can be saved and re-run every time 
an update of the data becomes available. Work is in progress to 
support the use of predefined SQL-based queries via the user 
interface. Query results can be exported to a CSV and/or 
FASTA file. It is possible to set-up an analysis workflow by 
configuring a query to execute a python post processing script. 
If the script generates statistical data in a graphical format, this is 
visualized in the query user interface after the query has been 
executed. 

When researchers make their tools available as web services, 
they can be easily integrated in RegaDB, lowering the threshold 
for clinicians and virologists to use such tools. 

RegaDB has been used in several collaborations including the 
Virolab EC project (virolab.org). Data from several European 
hospitals were stored in one RegaDB instance, resulting in a 
combined dataset of >8000 sequences. During the last phase of 
the project, we were able to combine our efforts with another EC 
project, EUResist (euresist.org), resulting in a combined data- 
base of >55000 sequences. 

Another example of the utility of RegaDB is the collaborative 
database used within the Southern African Treatment and 
Resistance Network (SATuRN). This network has 24 member 
institutions working in Southern Africa, the region at the 
epicentre of the HIV epidemic. Currently there are >10 institu- 
tions using the SATuRN RegaDB for patient data management, 



data curation and research. Under SATuRN, >7000 genotypes 
with treatment and monitoring data have been collected. Using 
the built-in customized report and query functionality, data of 
specific attributes are selected, analysed and used to answer spe- 
cific clinical and research questions (de Oliveira et al., 2010; 
Manasa et al., 2012). In addition, members of the SATuRN 
project recently published a book (Rosso uw et al., 2013) contain- 
ing a series of case studies used for training. More than 1450 
physicians and nurses have been trained through conferences, 
workshops and online web-tutorials. 

4 AVAILABILITY AND USAGE 

RegaDB is a software application that can be downloaded from 
the Internet and installed in a health care or research institute. 
Documentation, source files and binaries are available on 
http://rega.kuleuven.be/cev/regadb. Because of its modular and 
flexible design, RegaDB can be used in many different contexts 
and settings, from managing patient data in a clinical environ- 
ment to setting up large-scale research collaborations. Currently, 
all RegaDB instances are private instances that can only be ac- 
cessed by a restricted user base. Some of these instances are ac- 
cessible on the Internet; others are only accessible from within 
the institute's intranet. 

The current version of the software is already used for storing 
genetic data of HIV-1, HIV-2, HTLV (Araujo et al., 2012) and 
HCV isolates and related patient and clinical information. 
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